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We derive expressions for the predicitive information rate (PIR) for the class of autoregressive 
Gaussian processes AR(iV) , both in terms of the prediction coefficients and in terms of the power 
spectral density. The latter result suggests a duality between the PIR and the multi-information rate 
for processes with mutually inverse power spectra (i.e. with poles and zeros of the transfer function 
exchanged). We investigate the behaviour of the PIR in relation to the multi-information rate for 
some simple examples, which suggest, somewhat counter-intuitively, that the PIR is maximised for 
very 'smooth' AR processes whose power spectra have multiple poles at zero frequency. We also obtain 
results for moving average Gaussian processes which are consistent with the duality conjectured 
earlier. One consequence of this is that the PIR is unbounded for MA(iV) processes. 



I. INTRODUCTION 

The predictive information rate (PIR) of a bi-infinite se- 
quence of random variables (. . . , X—%, Xq, X\, . . .), was 
defined by Abdallah and Plumbley [Tj in the context of in- 
formation dynamics, which is concerned with the applica- 
tion of information theoretic methods [2j |3] to the process 
of sequentially observing a random sequence while main- 
taining a probabilistic description of the expected future 
evolution of the sequence. An observer in this situation 
can maintain an estimate of its uncertainty about future 
observations (by computing various entropies) and can 
also estimate the information in each observation about 
the as- yet unobserved future given the all the observations 
so far; this is the instantaneous predictive information or 
IPI. For stationary processes, the ensemble average of the 
IPI is the PIR. It is a measure of temporal structure that 
characterises the process as a whole, rather than on a 
moment-by-moment basis or for particular realisations of 
the process, in the same way that the entropy rate charac- 
terises its overall randomness. Abdallah and Plumbley [3] 
examined several process information measures and their 
interrelationships. Following the conventions established 
there, we let X t = (..., X t ~2, X t ~\) denote the variables 

before time t, and X t — (X t +i, X t +2, ■ ■ •) denote those 
after t. For a process with a shift-invariant probability 
measure [i, the predictive information rate is defined 
as a conditional mutual information 

= I(X t ;X t \X t ) = H(X t \X t ) - H{X t \X u X t ). (1) 

Equation ([!]) says that the PIR is the average reduction 
in uncertainty about the future on learning X t , given the 
past. In similar terms, three other process information 
measures can be defined: the entropy rate h^, the multi- 
information rate p M and the erasure or residual entropy 
rate r^, as follows: 

h ll = H(X t \X t ), (2) 
p^=I{X t -Xt)=H(X t )-H{X t \X t ), (3) 
r^ = H{X t \X u X t ), (4) 
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FIG. 1. I-diagram representation of several information mea- 
sures for stationary random processes. Each circle or oval 
represents a random variable or sequence of random variables 
relative to time t = 0. The circle represents the 'present'. Its 
total area is H(Xq) = p M + r M + fe M , where p M is the multi- 
information rate, r M is the residual entropy rate, and 6 M is the 
predictive information rate. The entropy rate is — r M + 6 M . 

These measures are illustrated in an information diagram, 
or I-diagram [5], in ng.nj which shows how they partition 
the marginal entropy H[X t ), the uncertainty about a sin- 
gle observation in isolation; this partitioning is discussed 
in depth by James et al. [BJ. 



II. GAUSS-MARKOV PROCESSES 

A Gauss-Markov, or autoregressive Gaussian process of 
order TV is a real-valued random process (-Xt)tgz on the 
domain of integers such that 

N 

X t = U t + J2iPkXt-k, (5) 
fe=i 

where the innovations Ut form a sequence of independent 
and identically distributed Gaussian random variables 
with zero mean and variance a 2 , and the ipk are the auto- 
gressive or prediction coefficients. Thus, a realisation of 
the random process X is the result of applying an order- N 
infinite impulse response (IIR) filter to a realisation of 
the innovation sequence formed by the Ut- The class of 
such processes is known as AR(N). If the autregressive 
coefficients ipk are such that the filter is stable, the process 
will be stationary and thus may have well defined entropy 
and predictive information rates. We will assume that 
this is the case. 
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FIG. 2. Graphical model for a 2 nd order Gauss-Markov, or 
AR(2), process. The X± are the observed, real- valued random 
variables, while the Ut are the unobserved innovations. Each 
Xt is a deterministic (linear) function of its parents. 



C. Predictive information rate 

To derive an expression for the predictive information rate 
b/j,, we first note that the model is ./V th -order Markov, since 
the trailing segment of the process (Xjv+i , -Xjv+2 , • • •) 
is conditionally independent of the leading segment 
(. . . , X-i 7 X ) given intervening segment (Xi, . . . , Xn)- 
Writing X p:q for the finite segment (X p , X p+ i, . . . , X q ), it 
can be shown that 



I{Xo;X \X ) = I(Xq; Xi : n\X_n : _i). 



(12) 



A. Entropy rate 

From the defining equation ([5]) we can immediately 

see that H{X t \X t ) = H{X t \X t _ N , . . . , X t _ x ) = H{U t ), 
which depends only on a 2 , so 

hp = \\og2Ttea 2 . (6) 

It is known that the entropy rate of a stationary Gaussian 
process can also be expressed in terms of its power spectral 
density (PSD) function S : K — > R, which is defined as 
the discrete-time Fourier transform of the autocorrelation 
sequence j k = E X t X t - k : 



h — — oo 



Ike 



-iujk 



i 

27 



S{u)e w * do; 



(7) 

Using methods of toeplitz matrix analysis [7], the en- 
tropy rate can be shown, with suitable restrictions on the 
autocorrelation sequence, to be 



h„ = - | log 2ne H 

2 I 6 2tt 



log S(uj) dui 



(8) 



which is also known as the Kolmogorov-Sinai entropy for 
this process. Incidentally, this means that the variance of 
the innovations a 2 can be expressed as 



rr- = e.\p | — I \ogS(ui) dbJ 



B. Multi-information rate 



(9) 



The multi-information rate of a stationary Gaussian pro- 
cess was found by Dubnov [H] to be expressible as 



\ ( log h 



S(uj) da; - 



1 

2^ 



logS , (w) dw 



(10) 

We can see how this was obtained by noting that the 
marginal variance E X 2 = 70 can be computed from the 
spectral density function simply by setting k = in the 
inverse Fourier transform Q, yeilding 



H(X t ) = i (log 27re + log i- J S(u) 



(11) 



Since p p — H(X t ) — h^ and is given by the Kolmogorov- 
Sinai entropy of the Gaussian process ([8]), Dubnov's ex- 
pression (fl0| follows directly. 



Thus, to find the PIR, we need only consider the the 
2A^ + 1 consecutive variables around Xq, namely AT_jv : jv- 
Expanding the conditional mutual information in terms 
of entropies, we obtain 



(13) 



Since the segment AT_jv : o contains more than N elements, 
the second term is just N times the entropy rate, so 



H(Xi : n\X-N:-i) 



Nh„ 



(14) 



To evaluate iT(Xi : jv|-^-iV:-i)) we note that, for continu- 
ous random variables Y and Z, 



H{Y\Z) = J H(Y\Z=z)p(z) dz, 



(15) 



where p(z) is Z's probability density at z; that is, we 
find the entropy of Y given particular values of Z, and 
then average over the possible values of Z. If we find 
that H(Y\Z— z) is the same value independent of z, then 
H{Y\Z) is trivially that value. This is indeed what we 
will find when we apply this approach here, and so wc 
will examine the case where the variables (X_jy, ■ • ■ , X—i) 
have been observed with the values (x_at, . . . , a;_i) respec- 
tively. Under these conditions, we can, in effect, forget that 
X-tv:-! are random variables and investigate the joint 
distribution of Xq-.n implicitly conditioned on the obser- 
vation AT_jV:-i=£-A':-i- Referring back to ([5]), we may 
rewrite the recursive relation between random variables 
Xj for 1 < j < N given observations X-n-.-i as 



N 



\i=l / \i=j+l 

with a special case for Xq: 



(16) 



Xq = Uq + 1Pi x -^J 



(17) 



With an eye to the final sums in both the above equations, 
let us also define rrij for < j < N as 



JV 



(18) 



i=j+i 
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Consider now the following transformation of variables: 



i 3 ' 



(19) 



Putting X = (X u . . . , X N ) T , Y = (Yi, . . . , Y N ) T and 
m = (mi, . . . , toat) t , this can be written as a vector 
equation, 



Y = 4X-m, 



(20) 



where the matrix A is lower triangular with ones along 
the main diagonal: 



A 



1 

-ipi 

""02 





1 

-01 



V--0JV-1 -1pN-2 -1pN-c 



1/ 



(21) 



Equation ([20]) implies that iT(Y) = if(X) + log \A\, but 
since A has the above form, |^4| = 1 and so H (X) = H(Y). 



Substituting ( 16 1 and ( 18 ) into ( 19 1 



Y j = U j + i>jX . 



(22) 



Expanding X using (17 1 and writing in vector form, 



Y = U Uq& — m a, 



(23) 



where U is a spherical Gaussian random vector and a 
is a constant vector with components = —^i- The 
determinant of the covariance of Y and hence the entropy 
H(Y) can be found by exploiting the spherical symmetry 
of U and rotating into a frame of reference in which a is 
aligned with first coordinate axis and therefore has the 
components (||a||,0, . . .). In this frame of reference, the 
covariance matrix of Y is 



A 



lall 2 







V ... lj 

and therefore its determinant is just <j 2N (I + ||a|| 2 ). This 
gives us the entropy of Y and hence of X: 

if (X) = H(Y) = 1 log(2 7 r e( T 2 ) Ar (l + ||a|| 2 ). (24) 

By construction, H(X.) = H(Xi. N \X_ N ._i = x_ N .^i); 
that is, the entropy of Xi : n conditioned on the particular 
observed values X-n-.-i, but since |24| is independent of 
those values, we may conclude that H(Xi;n\X-n : -i) — 
H(X) and substitute ([24]) and ^ into ([it]) to obtain 



N-.-l) 



Nh„ 



H(X UN \X_ 
ilog(27r«7 2 ) Ar (l + ||a|| 2 ) - ±N\og2necr 2 



(25) 



Simplified and expressed in terms of the original filter 
coefficients i^k-. 



& M = ilog(\ + X>^ 



(26) 



Let us now consider the relationship between the PIR 
and power spectral density. For an autoregressive process, 
the PSD can be computed directly from the prediction 
coefficients via the filter transfer function, which is the 
z-transform of the filter impulse response. If we set ao = 1 
and cik — —ipk for 1 < k < N, and temporarily reuse the 
symbol H(-) to denote the transfer function, then, 



H(z) 



1 



1 



1 - t\)iz- 



-N 



-, (27) 



and S(u) = <r 2 \H(e iul )\ 2 . Since all the coefficients are 
assumed real, this gives 



S(u) 



Efe= a ^ k E J ,Lo a 3 e 



N 



Now, consider the integral of a 2 /S(uj) over one cycle of 
uj from — 7r to 7r: 

n N 



N N 



fc=0 j=0 
N N N 

E E a-kajZnSjk = 2tt E a l- 



Referring back to (261, this shows that, for AR(7V) pro- 



cesses at least, the predictive information rate can be 
expressed in terms of the power spectrum S(lj) as 



b » = 1 * l0S lL<m duJ - 



(28) 



Substituting in ([9]) for cr 2 , we get 



As a final step, this can be written entirely in terms of 
the inverse power spectrum, in an expression which is an 



exact parallel of (10): 



S{uj) J 

(29) 



4 
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ing (31) by X t -k and taking expectations yields 



^=+0.50 
%=0 



p /bits 



FIG. 3. Multi-information rate p M and predictive information 
rate for AR(1) processes with prediction coefficient V>i- The 
grey line is the asypmtote at fe M = | a bit. 



exposing an intriguing duality between the multi- 
information and predictive information rates on the one 
hand, and Gaussian processes whose power spectra are 
mutually inverse on the other. A similar duality was noted 
by Abdallah and Plumbley [1] in relation to the multi- 
information and the binding information in finite sets 
of discrete- valued random variables. Although derived 
for finite-order autoregressive process, we conjecture that 
( 29 ) may be valid for any Gaussian process for which the 
required intergrals exist, and return to this topic later, in 
our analysis of moving-average processes. 



D. Residual or erasure entropy rate 

Since the erasure entropy rate is = — b^, we can 
write r M in terms of the power spectrum as follows: 



r„ = - I log 2ire - log 



1 



1 



2n J_ n S(lu) 



duj 



(30) 



This concurs with the results of Verdu and Weissman [5] , 
which are presented there without proof. We outline a 
skeleton proof later in § \IV C\ 



III. AUTOREGRESSIVE EXAMPLES 

Here we compute and p M for some simple cases to get 
feel for their range of variation. In all cases, the processess 
are normalised to unit variance, so that 70 = E X 2 — 1 
for all t and the marginal entropy H{X t ) = \ log 27re and 
therefore p^ + b^ = H(Xq) is constant. 

A. AR(1) processes 

The simplest case we can consider is that of the AR(1) 
processes, which, given the variance constraint, form a 
one-dimensional family parameterised by the prediction 
coefficient ipi. The generative equation is 



X t = U t +i/> 1 X t _ 1 . 



(31) 



The process will be stationary only if the corresponding 
IIR filter is stable, which requires that \t/)i\ < 1. Multiply- 



E X t X t _ k = E U t X t _ k + -01 E AVrAVfc, 

from which we obtain the Yule- Walker equations relating 
the autocorrelation sequence 7fc = E X t X t -k and the 
prediction coefficient tp\\ 



70 = o- + 71*011 

71 = 70^1 • 



(32) 



The variance constraint means that 70 = 1 and so we find 
that a 2 = 1 — Since p^ = H(X ) — and H(X ) is 
fixed at I log27re, we obtain the following results: 

p,, = -il0g(l- Ipf), 



§log(l + V?). 



(33) 
(34) 



Given the stability constraints on ipi, both quantities are 
minimised when ipi — 0, which corresponds to X being 
a unit-variance white noise sequence. Both p^ and b^ 
increase as ip\ — > ±1: the multi- information rate diverges 
while the PIR tends to 5 log 2 nats or half a bit. As 
■01 — > 1, the process becomes Brownian noise, whose 
sequence of first differences are white noise. The marginal 
variance constraint means that the innovation variance 
a 2 simultaneously tends to zero. However, since b^ is 
invariant to rescaling of the processes, the PIR of any 
(discrete-time) Brownian noise can be taken to be 0.5 bits 
per sample. If ipi — > — 1, the process no longer looks like 
Brownian noise, but can be obtained from one by reversing 
the sign of every other sample, that is, by applying the 
map X t ^ (-l)'AV 



B. AR(2) processes 

Second-order processes can be tackled in much the same 
way. In this case, the generative equation is 



X t = Ut+i/nXt-i+faXt^, 
and the Yule- Walker equations are 

70 = c 2 +71^1 +72^2, 

71 = 7o"0i +71^2, 

72 = 7l"01 +70^2- 

A little algebra eventually yields 

a 2 (l-0 2 ) 



(35) 



(36) 



7o = 
and therefore 



1 - V? ~ V> 2 V>2 -02-01+^2 



^loe 



1-^2 

i-(0 2 + 2 )(i + v>2) + 0i 



iiogti+^+v*!). 



(37) 

(38) 
(39) 



Fig. [4] illustrates how p M and b^ vary with ipx and ip2- The 
PIR is maximised when ipi = ±2 and ip2 = — 1> which 
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(a) AR(2), contours of p^/bits 








-1 



(b) AR(2), accessible region 




FIG. 4. (a) Multi-information rate p^ for AR(2) processes 
parameterised by prediction coefficients ipi and ifa. Numeri- 
cally computed contours of p M over a range from 0.05 to 4 are 
shown. The interior of the triangle (1, 0), (2, —1), (—2, —1) is 
the region of stability of the process, and p^ diverges to infinity 
at its edges. Contours of b M are not shown as they are simply 
circles centred at the origin, (b) Accessible region (shaded) 
of (p M ,b M ) pairs corresponding to interior of triangle in (a), 
computed numerically from contours at many values of p M . 
Upper and lower grey lines are upper and lower asymptotes 
at | log 2 6 ~ 1.29 bits and | log | « 0.29 bits respectively. 



corresponds to a transfer function H{z) (as in equation 27) 
with two poles in the same place at z — ±1. Similar to the 
AR(1) (ipi,ip2) approaches (2,-1), p M diverges 

and the process becomes non-stationary, but insteading 
of becoming Brownian noise, it becomes the cumulative 
sum of a Brownian noise: a quick check will verify that 
the sequence of second differences is white noise. 



The autocorrelation sequence 7& can be computed from a 2 
and the prediction coefficients using what is essentially a 
generalisation of the methods used in § |i7J A| and § |Ii7 B\ 
(as implemented in the MATLAB function rlevinson). 
From 7o/c 2 we compute p^. The points cover a region 
qualitatively similar to that shown in fig. |4|b) , but with 
different upper and lower asymptotes. Our initial investi- 
gations suggest that the upper limit of is approached 
if all the poles approach 1 or —1, at which point the 
prediction coefficients are the binomial coefficients and 
are easily computed. For example, at N = 8, we obtain 
6 M < § log 12870. As with AR(1) and AR(2), the resulting 
processes are such that the A^ th differences of the sequence 
are white noise, but since the variance of the innovations 
tends to zero, the processes themselves appear increasingly 
smooth and are dominated by low frequencies. 



IV. MOVING-AVERAGE PROCESSES 

A moving-average Gaussian process of order N is a real- 
valued random process {X t )tez such that 



N 
k=0 



(41) 



where the Ut form a sequence of independent Gaussian 
random variables with zero mean and variance a 2 . Thus, 
a realisation of the process X is the result of applying an 
order- N finite impulse response (FIR) filter with coeffi- 
cients bk to a realisation of the sequence formed by the Ut- 
The class of such processes is known as MA(AT). Without 
loss of generality, we may assume b\ = 1 , since any overall 
scaling of the process can be absorbed into a 2 . We may 
also assume that none of the roots of the filter transfer 
function polynomial B(z) = X^fcLo bk z ~ k are outside the 
unit disk in the complex plane, by the following argument: 
assuming bo = 1, B(z) can be expressed in terms of its 
N roots Pi, ...,/? n as 



B(z) 



N 



(42) 



fc=i 



The spectral density at angular frequency lo is therefore 



C. AR(iV) processes with random poles 

In our final example, we take a look at higher-order autore- 
gressive processes. In order to ensure that we consider only 
stable processes, we generate them by randomly sampling 
their poles inside the unit circle in the complex plane. 
From the poles we compute the prediction coefficients by 
expanding the factorised form of the transfer function, 
which, for N poles at Qi, . . . , (jy, is 



H{z) 



n£=i(*-co 



(40) 



\B{e 



N 

m 

k=l 



■Pk\ 



(43) 



The Gaussian process is uniquely determined by giving 
either its autocorrelation sequence or its spectral density 
function. If we move any of the roots j3k without changing 
the value of S^w) for any the FIR filter may be 

different but the process itself will be remain unchanged. 
Suppose one of the roots is at ( and \(\ > 1. Its contribu- 
tion to the PSD is a factor of 



CI 2 HC^(i/C- 



ICI 



CT 



G 




FIG. 5. Graphical model for an MA(1) first order moving- 
average Gaussian process. The Xt are the observed, real- valued 
random variables, while the Ut are unobserved. Each Xt is a 
deterministic (linear) function of its parents. 



where ( = is the reciprocal of the complex conjugate 
of £ and hence inside the unit disk. Thus, the root £ can 
be replaced with ( and the only effect on the PSD is 
the introduction of the constant factor |£| 2 , which can be 
absorbed into a 2 . In this way, all the roots of B{z) that 
are outside the unit disk can be moved inside without 
changing the statistical structure of the process. Noting 
that ( 41 ) can be written as 



N 



u, 



t-k, 



(44) 



fe=i 



we see that the sequence (. . . , Ut—t, Ut) can be computed 
from the sequence (. . . , X t ) via a stable IIR filter 

with the transfer function 1/B(z). These properties will be 
useful when we try to determine the process information 
measures of the MA(N) process. 

The first thing to note about this model is that it is does 
not have the Markov conditional independence structure 
of the AR(iV) model. Consider the graphical model of an 
MA(1) process depicted in fig. [5] even though X 2 and X± 
are marginally independent (since their parent node sets 
are disjoint and independent), they become conditionally 
dependent if X 3 is observed, because the known value of 
X 3 means that U2 and U± become functionally related. 
The same argument applies if an arbitrarily long sequence 
Xxu is observed: in this case, X and Xn +1 become condi- 
tionally dependent given X\-i. This lack of any finite-order 
Markov structure means that the measures /i„, and 
6 M cannot be computed from the joint distribution of any 



finite segment of the sequence, say X-m, as we did in § II 
but can be obtained by using spectral methods to analyse 
the covariance structure in the limit i —¥ 00. From (41 ), 
we obtain the autocorrelation sequence 

N N 
1m — E X t X t - m — bkU t ~k 2^ bjUt-m-j 
fc=0 3=0 
N N 

= ^2 ^ bkbjO- 2 d k , m +j (45) 
fc=o j=o 

JV 

= CT 2 ^2 bkbk-rn, 
k—m 

which is non-zero for at most 27V + 1 values of to, from 
—N to N. Hence, the covariance matrix R = E XX T 



of the multivariate Gaussian X = (X_£, . . . , JQ), when 
£ > N, will be a banded toeplitz matrix. For example, for 
an MA(1) process it will be 



/7o 

7i 



7i 
7o 



7i 



7i 
7o/ 



(46) 



A. Entropy rate 

In the case of MA processes and with our assumption 
that roots of the transfer function are not outside the unit 
disk, the Kolmogorov- Sinai entropy (JsJ) can be evaluated 
exactly by substituting in ( 43 1 and using Jensen's formula, 
which gives log|e IW - Qdu; = if |C| < 1: 

f logS(u>)du>= [ log a 2 IJle^-ftfdw 

J-7T J-7T fc=1 

= loga 2 + 2^/ logle^-ftl do;. 

U—1 J — T 



log a 



and hence 



\ log 27reer 2 



(47) 



This is consistent with our earlier observation that the 
innovations up to and including time t can be computed 
from the observations up to time t by IIR filtering the 
observations: in this case, the conditional variance of the 
next observation is just the variance of boUt+i, which is 

„2 



B. Multi-information rate 



From (41) and (45), the marginal variance is E X 2 = 

7o = o- 2 Y^k=o kfc' so > witn ^0 = 1, and p M = H(X t ) - h^, 
the multi-information rate is 



Pli = A log 1 




(48) 



which is in agreement with Ihara's result [101 §2.2]. Note 
that this is dual to the result obtained for the predictive 
information rate in AR(A^) processes (26), in that the 
FIR filter coefficients bk have taken the place of the IIR 
filter coefficients ipk or a^. 

C. Predictive information rate 

The PIR can be obtained from the erasure entropy rate 
r M using the relation b^ = — r M . Verdii and Weissman 
[S] state without proof that the erasure entropy rate of a 
Gaussian process with power spectral density S(ui) is 



= \ log 27re — \ log 



1 

2^ 



* 1 

7T S(0J) 



dui 



(49) 
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which, in combination with (|8|), yields 



1 

2^ 



log S(uj) duj + log 



1 

2^ 



do; 



the eigenvalues of a toeplitz matrix, this converges to 
an integral expressed in terms of the spectral density 
function: 



(50) 

which agrees with the expression we obtained earlier for 
AR processes. A skeleton of a proof of ( 49 ) can be obtained 
by considering the limit of H{Xo\X-t-i, Xi-j ) as £ — > 
oo. Let be the random vector (X_g, . . . ,Xi) with 
covariance matrix Rg constructed f rom the a utoc orrelation 
sequence as shown previously in ( 45 ) and ( 46 ) . If K 
R e 

density function pg 
Gaussian: 



(54) 



so we obtain the expected expression for the erasure en- 
tropy rate 



3nce as snown previously m (|^oj) ana (|<to|). u a = / 1 f 71 " 1 

is the corresponding precision matrix, the probability r M = 2 log 2-Ke — \ log f — / —, . doj 

tv function v f : R 2i+1 ->■ K for X* is the multivariate ^ ^ " ^' 



(55) 



p^(x) cx exp — if x 



(51) 



If we index the elements of x and K e starting with —I 
and running through to £, then it can easily be shown 
by examing the functional dependence of p^(x) on xq 
that the conditional density of the central variable Xq 
given given the values of all the others is Gaussian with 
variance 1/Kq , and hence the conditional entropy is 
H(X \X-(;-i, Xi-i) = — \ log . Now, since Rg is 
real and symmetric, it will have 2^+1 orthogonal eigen- 
vectors with real eigenvalues, and K can be represented 
in terms these as 



K 



jk 



E 



1 n v 3 nV km 



(52) 



where Vj n is the j component of the n eigenvector with 
eigenvalue r n . If Ri had been circulant as well as toeplitz, 
its eigenvectors would have been complex exponentials of 
the form V,-„ 



-27rijn/£ 



j \J2£ + 1, in which case, substi- 
tution into f52|) would yield r~ 1 /(2£ + 1) for all the 



diagonal elements. Instead, the standard approach [Jj is 
to construct two infinite sequences of matrices with are 
asymptotically equivalent. The first sequence consists of 
covariance matrices Re as £ increases, i.e., R±, i?2, etc. The 
second is a sequence of circulant approximations of the Rg. 
As £ — > oo, the sequences converge to each other (in the 
weak norm sense) and many properties of the Rg converge 
to those of their circulant approximations. This does not 
prove that all diagonal elements of the inverse RJ 1 con- 
verge in this way, and indeed, we would not expect them 
to for the extremal elements such as K^ u as this would be 
inconsistent with the result for the entropy rate. However, 
numerical results suggest that for 'central' elements K^- 
such that both j + £ and £ — j tend to infinity as £ tends 
to infinity, we can assume that the values do converge 
as expected. In particular, for the middle element, we 
suppose that 



lim KL = — 

£^oo 00 2£ 



1 i 

y 

+ 1 ^ . 



(53) 



This remains to be proved, but if we accept it, then by 
Szego's theorem [7], which applies to such functions of 



V. MOVING-AVERAGE EXAMPLES 

The simplest non-trivial moving-average process that we 
can consider is the MA(1) process 



X t = U t + b 1 U t - 1 , 



(56) 



where the sole parameter bi satisfies |£>i| < 1, according 
to the assumptions described at the beginning of §[TF] 
Using (481, we find that = | log(l + bf), which is 
dual to the result (57 1 obtained for the PIR of the AR(1) 
process obtained by inverting the spectrum of this MA(1) 
process. The transfer function of the two-tap FIR filter 
from U to X is H(z) = 1 + biz^ 1 . If we define H{z) — 
1/H(z) = 1/(1 + &iz _1 ), we can see that H(z) is the 
transfer function of the 1 st order IIR filter associated with 
an AR(1) process, where bi plays the role of the prediction 
coefficient. Clearly, the spectrum of this process, call it 
X, will be the inverse of the original process, and we can 
use the results of §|77J along with the duality relationship 
we observed relating the multi-information and predictive 
information rates, to compute the multi-information and 
predictive information rates of the moving-average process 
X. Referring back to § |i7i] we obtain 



p„ = ilog(l + 6?), 
& M = -|log(l-6?). 



(57) 
(58) 



Rather than repeat the process of illustrating these equa- 
tions, we refer the reader back to fig. [3j the relationship 
is the same except for swapping the p M and b^ axis labels 
and replacing ipi with b\. Indeed, the same reasoning can 
be applied to higher-order moving-average processes, so 
we can reuse figure [4] for moving-average process by swap- 
ping p M and b^, and replacing the prediction coefficients 
rpk with the moving-average coefficients b^. 

One implication of these results is that, even in the 
MA(1) process, the PIR approaches infinity as b\ ap- 
proaches ±1. In higher-order processes, the PIR diverges 
as the zeros of the transfer function approach the unit 
circle in the complex plane. In particular, the dual of 
the AR(^V) process identified in § III C . with all poles 
together at 1 or —1, is an MA(AQ process with all zeros at 
1 or —1, and maximises as & M diverges. With all zeros 



8 



at —1, the coefficents of the corresponding FIR filter are 
the binomial coefficients, and so as the order N tends to 
infinity, the filter approximates a smoothing filter with a 
Gaussian impulse response. 

VI. DISCUSSION AND CONCLUSIONS 

We have found a closed-from expression for the predictive 
information rate in autoregressive Gaussian processes of 
arbitrary finite order, which is a simple function of the 
predictive coefficients. It can also be expressed as function 
of the power spectral density of the process in a form which 
we conjecture may apply to arbitrary Gaussian processes 
and not just autogressive ones. The functional form also 
suggests a duality between the PIR and multi-information 
rate, since the PIR of a process with power spectrum S(uj) 
equals the multi-information rate of a process with the 
inverse power spectrum l/S(u). 

The fact that the stationary AR(1) and AR(2) pro- 
cesses maximising the PIR turn out to be, in the limit, 
Brownian motion and its (discrete time) integral is intru- 
iging and perhaps counter-intuitive: in order to preserve 
finite variance, both process have vanishingly small inno- 
vations, with a 1 tending to zero as the limit is approached, 
and therefore 'look smooth'. Indeed, as the order N is 
increased, the results of § | 77 J C| suggest that this pattern 
continues, with the PIR-maximising processes being in- 
creasingly 'smooth' and having power spectra more and 
more strongly peaked at uj = 0. The PIR, originally pro- 
posed pQ as a potential measure of complexity or 'interest- 
ingness' (for which purpose it seems a plausible candidate, 
at least for discrete valued processes), is telling us that 
these very 'smooth' Gaussian processes are somehow the 
most 'interesting'. 

The difficulty is presented even more starkly in the case 
of moving-average processes, where the PIR is unbounded, 
and we are forced to conclude that a single observation 



can yield infinite information about the unobserved future. 
Once again, we find that very 'smooth' looking processes 
can have arbitrarily high predictive information rates. 

The reason for this, we suggest, lies in the assumption 
that variables in a real-valued random sequence can be 
observed with infinite precision. Under these conditions, 
the tiny innovations observed in the unit-variance almost- 
Brownian noise of AR(1) when tpi approaches 1 are just 
as measurable as the macroscopic innovations in the non- 
Brownian case and are significant and informative in a 
predictive sense, because every innovation is preserved 
into the infinite future in the form of an additive shift 
to all subsequent values in the sequence. In addition, as 
soon as we have infinite precision measurements, we open 
the door to the possibility of infinite information; hence 
the divergence of and 6 M in these limiting cases. This 
rather un-physical situation can be remedied if we recog- 
nise that, in physically realisable systems, the variables 
can only be observed with finite precision, either by explic- 
itly modelling a quantisation error or by introducing some 
'observation noise', for example, by allowing infinite preci- 
sion observations only of Z t = X t + N t , where the N t are 
independent and Gaussian with some variance <r„. In this 
case, each observation can only yield a finite amount of 
information about X t , and it will no longer be possible to 
use infinitesimal variations to carry information about the 
future because they will be swamped by the observation 
noise. Recognising that what we are talking about here is 
essentially a hidden Markov model, we aim to establish 
these ideas on a more rigorous footing in future work. 
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